Basics of Data Visualization and Causal Graphs

Our first steps with ggplot2 and ggdag

## The libraries we will use today
library(palmerpenguins) # To get our example's dataset
library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(ggplot2) # To visualize data (this package is also loaded by library(tidyverse))
library(ggdag) # To create our DAGs

Welcome

This week’s tutorial will be divided in two broader camps.

  • First, we will learn some basics of data visualization with ggplot.
  • Second, we will start our exploration of directed acyclic graphs (DAGs) for causal inference.

Introduction to ggplot2

ggplot2 is by far the most popular visualization package in R. ggplot2 implements the grammar of graphics to render a versatile syntax of creating visuals. The underlying logic of the package relies on deconstructing the structure of graphs (if you are interested in this you can read this article).

For the purposes of this introduction to visualization with ggplot, we care about the layered nature of visualizing with ggplot2.


Our building blocks

During this week, we will learn about the following building blocks:

  • Data: the data frame, or data frames, we will use to plot
  • Aesthetics: the variables we will be working with
  • Geometric objects: the type of visualization
  • Theme adjustments: size, text, colors etc

Data

The first building block for our plots are the data we intend to map. In ggplot2, we always have to specify the object where our data lives. In other words, you will always have to specify a data frame, as such:

ggplot(name_of_your_df)

In the future, we will see how to combine multiple data sources to build a single plot. For now, we will work under the assumption that all your data live in the same object.


Aesthetics

The second building block for our plots are the aesthetics. We need to specify the variables in the data frame we will be using and what role they play.

To do this we will use the function aes() within the ggplot() function after the data frame (remember to add a comma after the data frame).

ggplot(name_of_your_df, aes(x = your_x_axis_variable, y = your_y_axis_variable))

Beyond your axis, you can add more aesthetics representing further dimensions of the data in the two dimensional graphic plane, such as: size, color, fill, to name but a few.


Geometric objects

The third layer to render our graph is a geomethic object. To add one, we need to add a plus (+) at the end of the initial line and state the type of geometric object we want to add, for example, geom_point() for a scatter plot, or geom_bar() for barplots.

ggplot(name_of_your_df, aes(x = your_x_axis_variable, y = your_y_axis_variable)) +
geom_point()

Theme

At this point our plot may just need some final touches. We may want to fix the axes names or get rid of the default gray background. To do so, we need to add an additional layer preceded by a plus sign (+).

If we want to change the names in our axes, we can utilize the labs() function.

We can also employ some of the pre-loaded themes, for example, theme_minimal().

ggplot(name_of_your_df, aes(x = your_x_axis_variable, y = your_y_axis_variable)) +
geom_point() +
theme_minimal() +
labs(x = "Name you want displayed",
y = "Name you want displayed")

Our first plot

For our very first plot using ggplot2, we will use the penguins data from last week.

We would like to create a scatterplot that illustrates the relationship between the length of a penguin’s flipper and their weight.

To do so, we need three of our building blocks: a) data, b) aesthetics, and c) a geometric object (geom_point()).

ggplot(penguins, aes(x = flipper_length_mm, y=body_mass_g)) +
  geom_point()


Activity 💡

Once we have our scatterplot. Can you think of a way to adapt the code to:

    1. convey another dimension through color, the species of penguin
    1. change the axes names
    1. render the graph with theme_minimal().

➨ Click for Answer 🔓
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
  geom_point() +
  theme_minimal()


Visualizing effectively

Plotting distributions

If we are interested in plotting distributions of our data, we can leverage geometric objects, such as:

  • geom_histogram(): visualizes the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin (the default is 30 bins).
  • geom_density(): computes and draws kernel density estimate, which is a smoothed version of the histogram.
  • geom_bar(): renders barplots and in plotting distributions behaves in a very similar way from geom_histogram() (can also be used with two dimensions)

This is a histogram presenting the weight distribution of penguins in our sample. .

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram()


Activity 💡

Let’s adapt the code of our histogram above and explore what happens:

    1. add bins = 15 argument type different numbers)
    1. add fill = "#FF6666" (type “red”, “blue”, instead of #FF6666)
    1. change the geom to _density and _bar

Plotting relationships

We can utilize graphs to explore how different variables are related. In fact, we did so before in our scatterplot. We can also use box plots and lines to show some of these relationships.

For example, this boxplot showcasing the distribution of weight by species:

ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot() +
  theme_minimal() +
  labs(x = "Species",
       y = "Body mass (g)")

Or this adaptation of our initial plot with a line of best fit for the observed data by each species:

ggplot(penguins, aes(x= flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() + 
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(x = "Length of the flipper",
       y = "Body mass (g)",
       color = "Species")


Next steps

Now that you have been introduced to some of the basics of ggplot2, the best way to move forward is to experiment. As we have discussed before, the R community is very open. Perhaps, you can gather some inspiration from the Tidy Tuesday social data project in R where users explore a new dataset each week and share their visualizations and code on Twitter under #TidyTuesday. You can explore some of the previous visualizations here and try to replicate their code.

Here is a curated list of awesome ggplot2 resources.


Directed Acyclic Graphs (DAGs)

This week we learned that directed acyclic graphs (DAGs) are very useful to express our beliefs about relationships among variables.

DAGs are compatible with the potential outcomes framework. They give us a more convenient and intuitive way of laying out causal models. Next week we will learn how they can help us develop a modeling strategy.

Today, we will focus on their structure and some DAG basics with the ggdag package.


Creating DAGs in R

To create our DAGs in R we will use the ggdag packages.

The first thing we will need to do is to create a dagified object. That is an object where we state our variables and the relationships they have to each other. Once we have our dag object we just need to plot with the ggdag() function.

Let’s say we want to re-create this DAG:

We would like to express the following links:

  • P -> D
  • D -> M
  • D -> Y
  • M -> Y

To do so in R with ggdag, we would use the following syntax:

dag_object <- ggdag::dagify(variable_being_pointed_at ~ variable_pointing,
variable_being_pointed_at ~ variable_pointing,
variable_being_pointed_at ~ variable_pointing)

After this we would just:

ggdag::ggdag(dag_object)

Let’s plot this DAG

our_dag <- ggdag::dagify(d ~ p,
                         m ~ d,
                         y ~ d,
                         y ~ m)

ggdag::ggdag(our_dag)


Activity 💡

See what happens when you add + theme_minimal(), + theme_void(), or + theme_dag() to the DAG above. What package do you think is laying behind the mappings of ggdag?


Polishing our DAGs in R

As you may have seen, the DAG is not rendered with the nodes in the positions we want.

If you ever want to explicitly tell ggdag where to position each node, you can tell it in a Cartesian coordinate plane.

Let’s take P as the point (0,0):

coord_dag <- list(
  x = c(p = 0, d = 1, m = 2, y = 3),
  y = c(p = 0, d = 0, m = 1, y = 0)
)

our_dag <- ggdag::dagify(d ~ p,
                         m ~ d,
                         y ~ d,
                         y ~ m,
                         coords = coord_dag)

ggdag::ggdag(our_dag) + theme_void()


More complex example:

Let’s say we’re looking at the relationship between smoking and cardiac arrest. We might assume that smoking causes changes in cholesterol, which causes cardiac arrest:

smoking_ca_dag <- ggdag::dagify(cardiacarrest ~ cholesterol,
                                cholesterol ~ smoking + weight,
                                smoking ~ unhealthy,
                                weight ~ unhealthy,
                                labels = c("cardiacarrest" = "Cardiac\n Arrest", 
                                           "smoking" = "Smoking",
                                           "cholesterol" = "Cholesterol",
                                           "unhealthy" = "Unhealthy\n Lifestyle",
                                           "weight" = "Weight"),
                                latent = "unhealthy",
                                exposure = "smoking",
                                outcome = "cardiacarrest")

ggdag::ggdag(smoking_ca_dag, # the dag object we created
             text = FALSE, # this means the original names won't be shown
             use_labels = "label") + # instead use the new names
  theme_void()

In this example, we:

    1. Used more meaningful variable names
    1. Created a variable that was the result of two variables vs. just one (cholesterol)
    1. Used the “labels” argument to rename our variables (this is useful if your desired final variable name is more than one word)
    1. Specified which variables are latent, the exposure, and outcome

Common DAG path structures


coord_dag <- list(
  x = c(d = 0, x = 1, y = 2),
  y = c(d = 0, x = 1, y = 0)
)

our_dag <- ggdag::dagify(x ~ d,
                         y ~ d,
                         y ~ x,
                         coords = coord_dag)

ggdag::ggdag(our_dag) + theme_void()


Activity 💡

Let’s adapt the code to make X a confounder and a collider.


➨ Click for Answer 🔓


Confounder:

confounder_dag <- ggdag::dagify(d ~ x,
                                y ~ d,
                                y ~ x,
                                coords = coord_dag)

ggdag::ggdag(confounder_dag) + theme_void() 

Collider:

collider_dag <- ggdag::dagify(x ~ d,
                              y ~ d,
                              x ~ y,
                              coords = coord_dag)

ggdag::ggdag(collider_dag) + theme_void() 


DAGs and modeling (a brief introduction)

As we can remember from our lecture, there is a set of key rules in understanding how to employ DAGs to guide our modeling strategy.

  • A path is open or unblocked at non-colliders (confounders or mediators)
  • A path is (naturally) blocked at colliders
  • An open path induces a statistical association between two variables
  • Absence of an open path implies statistical independence
  • Two variables are d-connected if there is an open path between them
  • Two variables are d-separated if the path between them is blocked

Here it is once again, because it is very important

In this portion of the tutorial, we will demonstrate how different biases come to work when we model our relationships of interest.


What happens when we control for a collider?

The case for beauty, talent, and celebrity (What happens when we control for a collider?)

As it is showcased from our DAG, we assume that earning celebrity status is a function of an individual’s beauty and talent.

We will simulate 1000 data points that reflect these assumptions. In our world, someone gains celebrity status if the sum of units of beauty and celebrity are greater than 8.

# beauty - 1000 observations with mean 5 units of beauty and sd 1.5 (arbitrary scale)
beauty <- rnorm(1000, 5, 1.5)

# talent - 1000 observations with mean 3 units of talent and sd 1 (arbitrary scale)
talent <- rnorm(1000, 3, 1)

# celebrity - binary
celebrity_status <-  ifelse(beauty + talent > 8, "Celebrity" , "Not Celebrity") # celebrity if the sum of units  are greater than 8

celebrity_df <- dplyr::tibble(beauty, talent, celebrity_status) # we make a df with our values

head(celebrity_df, 10)
## # A tibble: 10 × 3
##    beauty talent celebrity_status
##     <dbl>  <dbl> <chr>           
##  1   4.00   3.47 Not Celebrity   
##  2   4.17   4.73 Celebrity       
##  3   4.70   2.61 Not Celebrity   
##  4   5.01   2.01 Not Celebrity   
##  5   3.21   1.32 Not Celebrity   
##  6   6.22   4.27 Celebrity       
##  7   5.84   1.57 Not Celebrity   
##  8   2.32   3.34 Not Celebrity   
##  9   2.49   2.66 Not Celebrity   
## 10   4.36   2.97 Not Celebrity

In this case, as our simulation suggests, we have a collider structure. We can see that celebrity can be a function of beauty or talent. Also, we can infer from the way we defined the variables that beauty and talent are d-separated (ie. the path between them is closed because celebrity is a collider).

Say you are interested in researching the relationship between beauty and talent for your Master’s thesis, while doing your literature review you encounter a series of papers that find a negative relationship between the two and state that more beautiful people tend to be less talented. The model that these teams of researchers used was the following:

\[Y_{Talent} = \beta_0 + \beta_1Beauty + \beta_2Celebrity\]

Your scientific hunch makes you believe that celebrity is a collider and that by controlling for it in their models, the researchers are inducing collider bias, or endogenous bias. You decide to move forward with your thesis by laying out criticism of previous work in the field, given that you consider the formalization of their models is erroneous. You utilize the same data previous papers used, but based on your logic, you do not control for celebrity status. This is what you find:

True model

true_model_celebrity <- lm(talent ~ beauty, data = celebrity_df)
summary(true_model_celebrity)
## 
## Call:
## lm(formula = talent ~ beauty, data = celebrity_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1212 -0.6345  0.0028  0.6685  3.4793 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.86566    0.11287  25.390   <2e-16 ***
## beauty       0.01951    0.02172   0.898    0.369    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.014 on 998 degrees of freedom
## Multiple R-squared:  0.0008074,  Adjusted R-squared:  -0.0001938 
## F-statistic: 0.8064 on 1 and 998 DF,  p-value: 0.3694
ggplot(celebrity_df, aes(x=beauty, 
                         y=talent)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Beauty",
       y = "Talent")

Biased model from previous literature

Let’s see:

biased_model_celibrity <- lm(talent ~ beauty + celebrity_status, data = celebrity_df)
summary(biased_model_celibrity)
## 
## Call:
## lm(formula = talent ~ beauty + celebrity_status, data = celebrity_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.82834 -0.49080  0.02614  0.51531  2.77059 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    5.44487    0.14274   38.15   <2e-16 ***
## beauty                        -0.33580    0.02314  -14.51   <2e-16 ***
## celebrity_statusNot Celebrity -1.59909    0.06833  -23.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8152 on 997 degrees of freedom
## Multiple R-squared:  0.3551, Adjusted R-squared:  0.3538 
## F-statistic: 274.5 on 2 and 997 DF,  p-value: < 2.2e-16
ggplot(celebrity_df, aes(x=beauty, y=talent, color = celebrity_status)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Beauty",
       y = "Talent",
       color = "")

As we can see, by controlling for a collider, the previous literature was inducing a non-existent association between beauty and talent, also known as collider or endogenous bias.


What happens when we fail to control for a confounder?

Shoe size and salary (What happens when we fail to control for a confounder?)

# sex - replicate male and female 500 times each
sex <- rep(c("Male", "Female"), each = 500) 

# shoe size - random number with mean 38 and sd 4, plus 4 if the observation is male
shoesize <- rnorm(1000, 38, 2) +  (4 * as.numeric(sex == "Male"))

# salary - a random number with mean 25 and sd 2, plus 5 if the observation is male
salary <- rnorm(1000, 25, 2) + (5 * as.numeric(sex == "Male"))

salary_df <- dplyr::tibble(sex, shoesize, salary)

head(salary_df, 10)
## # A tibble: 10 × 3
##    sex   shoesize salary
##    <chr>    <dbl>  <dbl>
##  1 Male      40.9   26.8
##  2 Male      39.6   30.6
##  3 Male      41.7   29.6
##  4 Male      41.4   32.8
##  5 Male      42.3   29.7
##  6 Male      41.9   28.2
##  7 Male      45.1   28.5
##  8 Male      45.4   32.0
##  9 Male      43.0   30.7
## 10 Male      43.2   32.8

Say now one of your peers tells you about this new study that suggests that shoe size affects an individuals’ salary. You are a bit skeptical and read it. The model that these researchers apply is the following:

\[Y_{Salary} = \beta_0 + \beta_1ShoeSize\]

Your scientific hunch makes you believe that this relationship could be confounded by the sex of the respondent. You think that by failing to control for sex in their models, the researchers are inducing omitted variable bias. You decide to open their replication files and control for sex. This is what you find:

\[Y_{Salary} = \beta_0 + \beta_1ShoeSize + \beta_2Sex\]


True model

true_model_salary <- lm(salary ~ shoesize + sex, data = salary_df)
summary(true_model_salary)
## 
## Call:
## lm(formula = salary ~ shoesize + sex, data = salary_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7120 -1.3442  0.0303  1.4450  6.0751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.53368    1.16975  20.973   <2e-16 ***
## shoesize     0.01035    0.03084   0.336    0.737    
## sexMale      5.05927    0.17977  28.144   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.021 on 997 degrees of freedom
## Multiple R-squared:  0.6151, Adjusted R-squared:  0.6143 
## F-statistic: 796.5 on 2 and 997 DF,  p-value: < 2.2e-16
ggplot(salary_df, aes(x=shoesize, y=salary, color = sex)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Shoe size",
       y = "Salary",
       color = "")


Biased model from previous literature

biased_model_salary <- lm(salary ~ shoesize, data = salary_df)
summary(biased_model_salary)
## 
## Call:
## lm(formula = salary ~ shoesize, data = salary_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7350 -1.9593  0.0689  1.9028  7.4865 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.73462    1.17364    2.33     0.02 *  
## shoesize     0.62059    0.02936   21.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.706 on 998 degrees of freedom
## Multiple R-squared:  0.3092, Adjusted R-squared:  0.3085 
## F-statistic: 446.8 on 1 and 998 DF,  p-value: < 2.2e-16
ggplot(salary_df, aes(x=shoesize, y=salary)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(x = "Shoe size",
       y = "Salary")

As we can see, by failing to control for a confounder, the previous literature was creating a non-existent association between shoe size and salary, incurring in omitted variable bias.


Acknowledgements

This tutorial is based largely on chapters 7 to 10 from the QPOLR book.

This script was drafted by Lisa Oswald and Sebastian Ramirez-Ruiz, with contributions by Adelaida Barrera Daza, Carolina Díaz Suárez, Sofía García-Durrer, and William Fernandez.